Goto

Collaborating Authors

 bullet point



A Training Examples

Neural Information Processing Systems

Market research indicates that there is a significant opportunity for a new co ee bar located in the heart of the downtown business district.


Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

Jimenez, Maria Lizarazo, Claros, Ana Gabriela, Green, Kieran, Toro-Tobon, David, Larios, Felipe, Asthana, Sheena, Wenczenovicz, Camila, Maldonado, Kerly Guevara, Vilatuna-Andrango, Luis, Proano-Velez, Cristina, Bandi, Satya Sai Sri, Bagewadi, Shubhangi, Branda, Megan E., Zahidy, Misk Al, Luz, Saturnino, Lapata, Mirella, Brito, Juan P., Ponce-Ponte, Oscar J.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.




Intent-Aware Schema Generation And Refinement For Literature Review Tables

Padmakumar, Vishakh, Chang, Joseph Chee, Lo, Kyle, Downey, Doug, Naik, Aakanksha

arXiv.org Artificial Intelligence

The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by generating schemas defining shared aspects along which to compare papers. However, progress on schema generation has been slow due to: (i) ambiguity in reference-based evaluations, and (ii) lack of editing/refinement methods. Our work is the first to address both issues. First, we present an approach for augmenting unannotated table corpora with \emph{synthesized intents}, and apply it to create a dataset for studying schema generation conditioned on a given information need, thus reducing ambiguity. With this dataset, we show how incorporating table intents significantly improves baseline performance in reconstructing reference schemas. We start by comprehensively benchmarking several single-shot schema generation methods, including prompted LLM workflows and fine-tuned models, showing that smaller, open-weight models can be fine-tuned to be competitive with state-of-the-art prompted LLMs. Next, we propose several LLM-based schema refinement techniques and show that these can further improve schemas generated by these methods.


LLM Unlearning Without an Expert Curated Dataset

Zhu, Xiaoyuan, Zhang, Muru, Liu, Ollie, Jia, Robin, Neiswanger, Willie

arXiv.org Artificial Intelligence

Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.


R4), relevant to the conference (R2, R4), and is generally on an interesting topic (R1, R2)

Neural Information Processing Systems

We thank the reviewers for their work and for the positive evaluation of our paper. R4), relevant to the conference (R2, R4), and is generally on an interesting topic (R1, R2). Thus, we also provided guarantees for SO without strong convexity. Adding a small amount of regularization is also a common practice for numerical stability. Reviewer 2. We appreciate your support of our paper.


Idiosyncrasies in Large Language Models

Sun, Mingjie, Yin, Yida, Xu, Zhiqiu, Kolter, J. Zico, Liu, Zhuang

arXiv.org Artificial Intelligence

In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, particularly for training on synthetic data and inferring model similarity. Code is available at https://github.com/locuslab/llm-idiosyncrasies.